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ABSTRACT 

We  present  a  view  of  the  likelihood  ratio  (LR)  gradient  estimation  technique  (also  called 
the  score  function  (SF)  method^  under  which  infinitesimal  perturbation  analysis  (IPA) 
can  be  viewed  as  a  (degenerate)  special  case,  by  selecting  appropriately  what  the  random 
component  u>  effectively  represents.  Varying  the  actual  meaning  of  u>  (i.e.  defining  the 
underlying  sample  space  in  different  ways)  might  define  different  variants  of  the  LR  method, 
some  of  them  mixing  IPA  with  more  traditional  LR.  We  illustrate  this  by  many  examples. 
We  also  give  general  conditions  under  which  the  gradient  estimators  are  unbiased. 
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1.  The  LR  gradient  estimation  technique 


Consider  a  stochastic  simulation  model  parametrized  by  a  real  vector  0  €  of  continuous 
parameters,  and  suppose  we  want  to  estimate  the  gradient  Va(0)  of  the  (differentiable) 
expected  value  a(0)  of  some  real-valued  objective  function.  Two  techniques  have  been 
proposed  recently  to  estimate  such  a  gradient  by  simulation:  infinitesimal  pertubation 
analysis  (IPA)  [5,  6,  7,  10,  12,  19]  and  likelihood  ratios  (LR)  [8,  14,  15,  16]. 

The  basic  idea  of  LR  is  that  a(d)  can  usually  be  viewed  as  the  expectation  of  some 
function  of  6  and  of  the  "sample  path”  u>,  say  h(0,  with  respect  to  a  probability  measure 
Pe(-)  over  some  measurable  space  (0,  E).  Here,  Q  is  the  sample  space,  and  u;  6  fl  represents 
all  the  "random  elements”  in  the  simulation,  so  that  when  it  is  fixed,  the  evolution  of  the 
system  becomes  deterministic.  More  specifically,  we  assume  that  h(0,  •)  is  E-measurable. 
Usually,  one  cannot  differentiate  this  expectation  directly  by  differentiating  inside  the 
integral,  and  one  of  the  reasons  is  that  Pe(‘)  typically  depends  on  6.  That  dependence  can 
be  eliminated  if  one  can  take  a  probability  measure  G(-)  on  the  same  measurable  space, 
independent  of  6 ,  and  that  dominates  the  P*(*)’b  for  0  in  the  region  of  interest.  In  that 
case,  one  can  rewrite: 

«(»)  =  jf  *(».<•>)*#(<*•>)  =  /„  G(<M  =  ja  (1) 

where  27(0, u;)  =  h(0,<jj)Pe(<L>)/G(dv).  The  ratio  Pe(cL>) / G(du>)  is  called  the  Radon- 
Nikodym  derivative  of  Pe(-)  with  respect  to  G(-).  By  the  Radon-Nikodym  theorem  ([1], 
theorem  32.2),  it  exists  and  (1)  is  valid  if  and  only  if  G(-)  dominates  Pe(-),  i.e.  iff  Pg(-)  is 
absolutely  continuous  with  respect  to  G(-),  i.e.  iff  for  every  measurable  set  B,  G(B)  =  0 
implies  P$(B)  =  0.  Note  that  if  sampling  is  done  using  G(*),  the  bracketted  term  in  (1)  can 
be  used  as  an  estimator  of  a(0).  This  "change  of  measure”  approach  is  called  importance 
sampling,  and  is  often  used  as  a  variance  reduction  technique  [9,  16, 17]. 

Under  appropriate  regularity  conditions  (that  permit  to  interchange  the  derivative  and 
expectation),  one  can  differentiate  q(-)  by  differentiating  the  bracketted  term  with  respect 
to  0  inside  the  integral: 

Va{0)  =  /  iJ>(0,iif)G(dw).  (2) 

J  n 

where 

*(»,<■>)  =  V,S(«,«)  =  (3) 

Sufficient  regularity  conditions  are  given  in  [5,  6,  7, 14, 16]  for  special  cases.  More  general 
conditions  are  given  in  section  3  of  this  paper.  The  need  for  these  regularity  conditions 
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and  for  the  existence  of  the  Radon-Nikodym  derivative  certainly  impose  some  limitations 
on  the  method  (see  the  examples  in  section  4),  but  there  are  also  many  practical  cases 
where  it  works  well. 

When  (2)  holds,  ^(0,u>)  can  be  used  to  estimate  Va(0).  Note  that  only  one  simulation 
experiment  (using  <?(-))  is  required  to  estimate  the  gradient.  In  principle,  can 

be  evaluated  at  any  value  of  0  for  which  (2)  holds,  permitting  to  estimate  of  the  gradient 
everywhere  by  a  single  simulation.  But  the  variance  of  the  gradient  estimator  is  sometimes 
dramatically  high  for  some  values  of  0  (see  e.g.  [15]).  The  method  can  also  be  generalized 
to  higher  order  derivatives  (see  [14,  15]). 

How  do  we  choose  (?(•)  ?  Among  those  G(*)  for  which  (1-2)  hold,  one  would  like  to 
choose  one  for  which  the  variance  is  low.  But  this  is  not  always  easy  to  do.  In  practice, 
generating  values  of  u>  according  to  G(-)  and  computing  V’(^,  w)  for  any  generated  u>  should 
also  be  done  easily  and  at  reasonably  low  cost.  This  usually  limits  the  set  of  interesting 
choices. 


To  estimate  Va(0o)  at  a  single  point  0O,  an  easy  choice  for  G(-)  (when  it  is  admissible) 
is  Pg „(■).  In  that  case,  things  simplify  in  (3)  and  we  obtain: 


Va(*0)  =  f 
J  n 


(Veh(0,  w)  +  h(0,u/)Vg  lnPg(du;)) 


e=e0i 


Pe0(dw). 


(4) 


The  expression  Vg\nPg(du)  is  called  the  score  function  (SF).  The  LR  gradient  estima¬ 
tor  has  been  introduced  [8,  14,  15]  as  the  bracketted  expression  in  (4),  sometimes  in  a  less 
general  setting.  In  this  paper,  we  take  the  following  definition. 

DEFINITION  1.  A  LR  gradient  estimator  for  Va(0)  is  one  that  is  defined  by  the 
bracketted  expression  in  (4),  wh  .e  w  obeys  Pg0,  provided  that  this  expression  exists  for 
almost  all  w,  and  fc(0,  •)  is  a  measurable  function  of  u >  whose  expectation  is  a(0).  I 

The  estimator  defined  in  (3)  can  be  viewed  as  a  combination  of  LR  with  importance 
sampling.  It  is  also  a  generalization  of  LR  (see  also  [16]). 

We  have  not  talked  much  yet  about  the  choice  of  the  sample  space  D,  but  this  is  one  of 
the  key  points  in  this  paper.  In  fact,  for  a  given  simulation  model,  there  can  be  different 
ways  of  defining  the  sample  space  and  the  meaning  of  u>.  In  many  simulations,  all  random 
variables  are  generated  by  first  generating  If (0,1)  variates,  and  transforming  them  in  the 
appropriate  way.  Hence,  u>  can  be  viewed  as  a  sequence  of  independent  lf(0, 1)  variates, 
and  the  value  of  the  objective  function  h(0t  w)  is  a  deterministic  function  of  this  sequence. 
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But  this  is  only  one  way  of  viewing  it.  In  fact,  there  is  no  need  to  assume  that  17(0,1) 
variates  are  used  to  drive  the  simulation  in  the  first  place.  For  instance,  the  LR  technique 
can  be  used  to  estimate  the  gradient  not  only  for  a  simulation  model,  but  also  for  a  real 
system  (provided  P$(-)  is  known  and  can  be  observed).  In  that  case,  u>  will  usually  not 
be  a  sequence  of  C7 (0, 1)  variates.  Consider  for  instance  a  M/G/l  queue.  One  possibility 
is  to  view  ui  as  the  sequence  of  interarrival  and  service  times,  and  define  fl  =  [0,  oo)°°  as 
the  sample  space.  Note  that  in  that  case,  the  distribution  of  u>  depends  on  0,  whereas  it 
does  not  when  u>  is  defined  as  a  sequence  of  17(0, 1)  variates. 

So,  there  might  be  different  ways  of  defining  the  sample  space  (and  the  associated 
probability  space  (Cl,  S,  P$(-))).  Different  choices  may  lead  to  different  gradient  estimators, 
some  being  more  efficient  than  others.  We  will  come  back  to  this  point  in  the  next  section. 

Some  might  feel  more  confortable,  for  some  reason,  with  an  underlying  probability 
space  in  which  the  “basic”  random  element  is  a  sequence  of  independent  17(0, 1)  variates. 
Let  (fi,E,P(-))  be  such  a  space.  If  (D,E ,P*(*))  ^  (fi,E,P(*)),  we  assume  that  there  is  a 
measurable  transformation  <f>$  :  Cl  —*  Cl  such  that  u>  =  <£$(u>),  and  such  that  Pg( u;  €  •)  = 
P(w  €  4>g  *(•))  Note  that  w  may  contain  less  information  than  u>,  and  that  its  probability 
law  may  depend  on  9.  We  can  also  define  the  P( ^-measurable  function  h(9,  •)  by  h(9,u>)  = 
h(9,<j>g(u>))  (note  that  h(9,  •)  is  Pj(-)-measurable).  We  have 

a(9)=  [  h(9,w)Pe(<Lj)  =  [  h(9,u)P(dw). 

J  n  Jn 

Note  that  this  notion  of  an  underlying  sample  space  Cl  and  transformation  <f>g  is  not  really 
necessary.  We  introduced  it  here  just  to  clarify  some  links  with  the  common  practice  (in 
simulation)  of  viewing  the  sample  space  that  way.  What  we  really  have  in  mind  in  to  just 
define  the  sample  space  as  Cl  and  forget  about  Cl. 

Usually,  for  convenience,  w  is  defined  as  a  sequence  of  independent  (univariate)  ran¬ 
dom  variables  that  represent  the  stochastic  aspects  of  the  simulation:  u>  =  (u>i,u>i, . . .). 
Pe(<L>)/<b>  is  then  a  product  of  univariate  density  or/and  probability  mass  functions,  and 
Vg\nPe{<Lj)  is  a  sum  whose  number  of  terms  is  typically  the  number  of  such  univari¬ 
ate  functions  that  depend  on  9.  Assuming  that  the  variances  of  the  sample  performance 
h(9,  u>)  and  gradient  Vgh(9,u>)  are  bounded,  the  variance  of  the  LR  gradient  estimator 
increases  linearly  (in  general)  with  that  number  (which  is  typically  a  linear  function  of  the 
simulation  length).  From  this  reasoning,  we  should  expect  LR  to  work  much  better  for 
terminating  simulations  for  which  only  a  small  number  of  random  variates  are  generated 
with  probability  laws  that  depend  on  9. 
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Note  that  since  regenerative  simulations  can  be  analyzed  in  a  way  very  similar  to  ter¬ 
minating  simulations  [8,  9],  the  above  remark  also  applies  to  steady-state  regenerative 
simulations  for  which  a  small  number  of  0-dependent  variates  are  generated  per  regen¬ 
erative  cycle.  A  version  of  the  LR  method  specially  adapted  for  regenerative  systems  is 
presented  in  [8]. 
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2.  Choosing  what  u  should  represent 


Suppose  that  u>  is  defined  as  a  sequence  of  independent  17(0, 1)  variates.  Then,  P»{-)  is 
independent  of  9,  H(9,w)  =  h(0,u>),  and  the  last  term  in  (3)  vanishes.  In  that  case, 
equation  (4)  becomes  (under  the  appropriate  regularity  conditions  and  with  9  =  0O)-' 

Va(9)  =  /  V8h(9,u,)P6(du>)  (5) 

Jn 

and  V$h(0,u;)  is  the  usual  IPA  gradient  estimator  ([2,  5,  6,  12,  16]). 

DEFINITION  2.  An  IPA  estimator  for  Va(0)  is  defined  as  V<ih(^,u;)  (provided  that 
this  quantity  exists  for  almost  all  u>),  where  is  a  sequence  of  independent  (7(0, 1)  random 
variables,  and  h(9,  •)  is  a  measurable  function  of  u>  whose  expectation  is  a(0).  I 

When  (5)  is  satisfied,  we  have  an  unbiased  IPA  gradient  estimator.  Note  that  there 
often  exists  different  functions  h(9,  •)  that  satisfy  (1)  and  (5),  and  thus  different  unbiased 
IPA  gradient  estimators  for  the  same  a(-).  In  practice,  the  function  h(9,  •)  is  usually  defined 
or  implied  by  the  simulation  model. 

The  basic  idea  of  IPA  is  to  generate  a  sample  path  u>,  viewed  as  a  sequence  of  17(0, 1) 
variates,  and  for  fixed,  observe  the  effect  of  an  infinitesimal  perturbation  on  9  (around 
90)  by  propagating  it  over  the  sample  path,  assuming  that  the  sequence  of  events  do  not 
change,  and  that  the  events  only  “slide”  in  time.  The  gradient  estimation  is  taken  as  the 
gradient  of  the  objective  function  for  that  fixed  value  of  u>.  Note  that  the  propagation 
rules  permit  to  evaluate  Vgh(9,  w)  only  at  9  =  90,  and  thus  to  estimate  the  gradient  only 
at  90.  If  that  definition  of  u>  is  used  in  (2)  with  G(- )  ^  P8(’),  one  gets  a  combination  of 
IPA  with  importance  sampling. 

According  to  the  above  definitions,  IPA  can  be  viewed  as  a  special  case  of  LR.  One 
big  advantage  of  IPA  is  that  since  no  component  of  u;  depends  on  9,  the  variance  does 
not  increase  any  more  with  the  simulation  length.  But  the  function  h(9,u>)  must  absorb 
all  the  tranformations  and  may  become  overly  complex,  sometimes  making  the  actual 
computation  of  Vgh(9,u>)  intractable,  or  invalidating  (2).  In  fact,  a  large  part  of  the  IPA 
literature  deals  with  the  development  of  effective  techniques  to  compute  Vgh(9,u>)  during 
the  simulation  (see  e.g.  [10,  18]).  One  might  even  associate  the  term  IPA  more  with  these 
techniques  than  with  equation  (5),  and  these  “IPA”  techniques  can  be  used  to  implement 
LR  as  well.  Some  people  might  also  argue  that  since  the  likelihood  ratio  has  disappeared 
in  (5),  this  is  no  more  LR.  In  fact,  we  just  view  it  as  a  degenerate  case. 
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As  we  said  before,  u>  can  also  represent  something  else  than  a  sequence  of  U(Q,l) 
variables.  For  example,  u;  can  be  viewed  as  representing  the  whole  history  of  the  system, 
including  all  events  with  their  types  and  occurence  times,  etc.  In  many  cases,  u;  can  carry 
enough  0-dependent  high  level  information  so  that  for  a  given  value  of  u ;,  h(B,  u>)  does 
not  depend  any  more  on  0,  and  the  first  term  of  the  right  hand  side  expression  in  (3) 
vanishes.  But  then,  one  might  be  unable  to  write  down  P$(-)  explicitly,  preventing  the 
actual  computation  of  the  estimate.  For  this  reason,  w  will  usually  be  taken  as  a  sequence 
of  independent  random  variables.  In  Glynn  [8],  for  example,  when  simulating  a  Markov 
chain,  u;  is  taken  as  the  sequence  of  states  visited  by  the  chain. 

For  the  extreme  case  where  u>  is  defined  as  the  value  of  the  objective  function  itself, 
i.e.  h(0,u>)  =  u>  (and  Cl  =  1R)  by  definition,  P${-)  is  actually  the  distribution  function  of 
the  cost.  When  we  can  write  it  down  and  write  down  the  likelihood  ratio,  there  is  usually 
no  need  to  simulate,  since  Va(0)  can  be  computed  directly.  Monte-Carlo  methods  are 
precisely  useful  for  the  cases  where  we  cannot  efficiently  compute  the  expression  directly. 

Between  these  extremes,  there  is  often  different  other  possibilities.  For  instance,  if  a 
set  of  17 (0, 1 )  values  must  go  through  many  levels  of  transformation,  one  may  choose  any 
one  of  the  levels  to  define  u.  Also,  u>  might  contain  the  original  17(0, 1)  values  for  some  of 
the  generated  random  variables,  and  the  transformed  values  for  others.  This  gives  rise  to 
hybrid  methods,  “mixing”  in  some  way  IPA  with  LR.  According  to  our  definitions,  this  is 
still  LR.  In  section  4,  we  give  examples  for  which  one  might  think  that  LR  does  not  apply, 
but  for  which  LR  effectively  applies  if  the  sample  space  and  u;  are  defined  appropriately. 

But  what  is  the  best  way,  then,  to  define  u;  ?  There  is  no  easy  answer  to  this  question. 
There  is  no  straightforward  recipe.  Of  course,  one  would  like  (2)  to  be  valid.  There  are 
examples  for  which  (2)  is  valid  if  u  represents  higher  level  information,  like  e.g.  the  set 
of  actual  interarrival  times,  service  times  and  transitions  between  nodes  in  a  queueing 
network,  and  not  valid  if  u;  represents  the  sequence  of  17(0,1)  variates.  But  for  other 
examples,  the  opposite  is  true  (see  [14,  5,  6]  and  the  examples  in  section  4).  In  certain 
situations,  (2)  might  be  valid  for  none  of  the  extreme  cases,  but  for  some  intermediate 
definition  of  u>  (see  example  4.5).  If  it  is  valid  for  many  possible  definitions  of  u>,  one  will 
then  try  to  minimize  the  variance.  This  is  certainly  problem- dependent,  but  from  the  last 
two  paragraphs  of  the  previous  section,  trying  to  put  the  least  number  of  0-dependent 
components  in  w  appears  to  be  a  good  strategy. 

One  consequence  of  the  above  discussion  is  that  many  properties  of  the  IPA  method 
also  apply  to  LR,  and  vice-versa.  For  example,  the  validity  of  interchanging  the  derivative 
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and  expectation  is  a  problem  for  the  LR  method  in  general.  Various  (problem  dependent) 
devices  have  been  suggested  to  “smooth  out”  or  transform  some  problems  for  which  IPA 
doesn’t  apply  directly,  into  problems  for  which  IPA  will  work  correctly  (see  e.g.  [10]  and 
the  references  in  [5,  6,  12]).  In  principle,  one  could  think  of  developping  such  devices  for 
LR  in  general. 

Note  that  for  the  case  where  a(0)  is  a  steady- state  performance  measure  and  u>  contains 
an  infinite  sequence  of  ^-dependent  random  variables,  the  Radon-Nikodym  derivative  in  (1) 
typically  does  not  exists.  However,  (5)  might  be  valid  in  that  case,  and  then,  one  typically 
has  Vgh($, uf)  =  Vct(0)  with  probability  one  (this  is  when  IPA  is  strongly  consistent). 
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3.  Interchanging  the  Derivative  and  Expectation 


In  this  section,  we  give  sufficient  conditions  for  the  interchange  of  derivative  and  expecta¬ 
tion  leading  to  (2)  to  be  valid.  Conditions  for  specific  cases  are  also  given  in  [5,  6,  7,  14, 
16],  and  are  sometimes  more  direct  to  verify.  But  often,  the  conditions  below  can  be  verify 
directly,  as  we  will  see  in  the  examples  of  the  next  section. 

Note  that  each  component  of  the  gradient  can  be  dealt  with  separately.  For  t  =  1 , . . . ,  d, 
to  study  the  i-th  component  of  the  gradient,  we  look  at  what  happens  when  only  the 
component  *  of  9  is  allowed  to  change  and  all  other  components  of  9  are  fixed.  To  simplify 
things,  in  this  section,  we  assume  that  d  =  1.  For  the  more  general  case,  just  apply 
the  results  below  to  each  component  of  9  (while  the  other  components  are  fixed).  All 
probabilistic  statements  in  this  section  are  made  with  respect  to  the  probability  measure 
(?(•).  The  lemma  below  is  an  adaptation  of  lemma  1  in  [5].  It  uses  the  following  assumption: 

Al.  Let  d  =  1.  There  is  a  neighbourhood  T  of  9  such  that  for  almost  all  u>,  H(-,u>)  exists 
and  is  continuous  over  T,  and  is  differentiable  everywhere  in  D(u>)  C  T,  where  T  \  D(u>) 
is  at  most  a  denumerable  set.  Assume  that  (1)  is  satisfied  for  all  9  g  T.  Also, 

sup  Mv,u>)|  (6) 

v£D(u>) 

is  integrable  with  respect  to  G(-)  (note  that  D(u>)  is  the  set  where  exists).  | 

LEMMA  1.  Under  Al,  equation  (t)  it  valid. 


PROOF.  The  proof  is  largely  inspired  by  the  proof  of  lemma  1  in  [5] .  From  a  generalized 
version  of  the  mean  value  theorem  (see  e.g.  theorem  8.5.2  in  [3]),  if  9  and  9  +  1 1  are  in  T 
and  w  satisfies  the  requirements  of  Al, 


H(9  +  h,u>)-  H{9,  u>) 
h 


<  sup  |V’(v,u>)|. 

v£D(ui) 


Hence,  from  the  dominated  convergence  theorem, 


=  Ja(umB{e  +  h’w)h  <?(*.) 

fc-o  \  h  ) 

=  Va{9).  I 
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4.  Examples 


In  this  section,  we  give  a  number  of  examples  to  illustrate  the  main  ideas  of  the  paper.  For 
some  of  them,  we  also  give  numerical  results.  The  first  five  examples  deal  with  a  simple 
M/G/l  queue  that  evolves  until  a  certain  (fixed)  number  of  departures  have  occured.  The 
next  to  examples  consider  the  lifetime  of  a  4-out-of-W  reliability  system,  without  repairs. 
The  following  one  discuss  a  general  continuous  time  Markov  chain,  while  the  last  one  looks 
at  sensitivity  with  respect  to  thresholds.  In  particular,  we  look  at  replacement  policies 
defined  by  thresholds  in  a  multicomponent  system.  In  the  latter  case,  we  actually  don’t 
know  how  LR  can  be  used  efficiently  to  estimate  the  gradient. 

Consider  a  M/G/l  queue,  initially  empty,  and  let  a(0)  be  the  expected  mean  system 
time  (waiting  +  service  times)  for  the  first  T  customers  in  the  system,  where  9  is  a  param¬ 
eter  of  the  servive  time  distribution.  The  arrival  rate  is  A  =  1.  We  want  to  estimate  the 
derivative  a'(9 )  at  a  given  point  6  =  0Q  by  simulating  at  that  point.  For  a  given  realization 
u>,  h(6,v)  represents  the  observed  average  waiting  time  for  the  T  customers.  We  have 

M  W=?EWtSf)  (?) 

4  «=2 

where  W,  and  5,  are  respectively  the  actual  (observed)  waiting  time  and  service  time  of 
customer  t  (these  are  deterministic  functions  of  9  and  u>).  Let  A,  denotes  the  interarrival 
time  between  customers  i  —  1  and  i  (Aj  is  the  arrival  time  of  customer  1  and  the  system 
starts  at  time  0).  We  have  W\  =  0,  and  Wl+1  =  max(0,  W,  —  A^+1  +  5.)  for  t  >  0.  The 
first  five  examples  below  are  variants  of  this  one;  only  the  service  time  distributions  differ. 
Application  of  IPA  to  this  system  has  been  analyzed  in  [19]  when  the  objective  function 
is  the  steady-state  average  system  time  per  customer.  It  has  been  shown  that  under 
some  conditions  on  the  service  time  distribution,  IPA  gives  an  asymptotically  unbiased 
and  strongly  consistent  gradient  estimate.  For  the  case  of  a  finite  number  of  customers 
(terminating  simulation),  the  validity  of  IPA  has  been  analyzed  e.g.  in  [5],  example  4. 

4.1.  A  M/M/1  queue 

Let  the  service  time  distribution  be  exponential  with  mean  6,  a  <  9  <  f>,  where  0  <  a  <  b. 
IPA  is  known  to  work  for  that  case:  assuming  that  the  interarrival  and  service  times  are 
generated  by  inversion,  one  can  take  u>  as  the  sequence  of  U(0, 1)  values  used  to  generate 
them,  i.e.  w  =  (U\,. . .  ,U2t)>  A.x  =  — ln(l  —  \J2i _i)  and  5,  =  —  01n(l  —  U2t).  Here, 
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G(-)  =  P$(-)-  An  infinitesimal  perturbation  on  5,  affects  the  system  time  of  customer  t 
and  of  all  the  customers  that  follow  him  in  the  same  busy  period  (if  any).  Therefore, 

1  t  flc 

T  .=1J6B.  96 

where  Bi  is  the  set  containing  customer  i  and  all  the  customers  that  precede  him  in  the  same 
busy  period  (if  any),  and  dSi/89  =  Si/6.  This  can  be  computed  during  the  simulation  as 
described  in  [19].  We  can  easily  verify  assumption  Al.  In  fact,  for  any  u>,  S,  is  continuous 
and  differentiable  in  6 ,  and  h(-,u>)  is  continuous  in  the  5,’s  (and  in  the  WVs,  which  are 
continuous  in  the  5,’s).  Also,  h(-,u>)  fails  to  be  differentiable  at  9  only  when  two  events 
(arrival  or  departure)  occur  simultaneously,  and  this  happens  at  most  for  a  finite  number 
of  values  of  9.  Since  supfl>a  r/>(9,u>)  is  clearly  integrable,  lemma  1  applies  and  IPA  provides 
an  unbiased  estimate  for  that  case. 

Another  choice  is  to  take  u  as  the  set  of  actual  interarrival  and  service  times:  u;  = 
(Ai,  Si, ... ,  At,  St)-  In  this  case,  Pg(du;)/cL>  is  the  product  of  their  densities: 

X 

P6(dw)  =  n  (-e-We-A'dSidA^J  , 

G(<Lj)  =  PgQ(<Lj),  and  Veh(9,u>)  =  0.  Note  that  only  the  service  time  densities  appear  in 
the  likelihood  ratio,  since  the  interarrival  times  are  independent  of  9  (in  fact,  taking  either 
the  actual  interarrival  times  or  the  corresponding  C/- (0, 1 )  values  in  u>  makes  no  difference 
here).  One  has 

'J'  r\  'J' 

VslnP,(&.)  =  E^ln  E(*  ~  »)■ 

This  can  be  computed  easily  together  with  h(9,u>)  during  the  simulation.  For  any  u>, 
is  continuous  and  differentiable  in  [a, b]  (note  that  H(9,u>)  depends  on  9  only 
throught  Pg ( duj ) ) .  Also,  the  gradient  estimator  is 

m*)  =  /p  (ew + 5.))  (d*  -  *)) 

and  supfi6[a  6]  |V»(^,u;)|  is  Pg0 -integrable,  so  that  lemma  1  applies. 

In  principle,  one  can  also  combine  the  two  approaches  and  take  5,  for  some  of  the 
customers  and  the  corresponding  U( 0,1)  values  for  others.  This  would  give  rise  to  more 
complex  expressions  but  could  be  implemented  in  practice  without  too  much  difficulty, 
and  lemma  1  would  still  apply.  There  might  be  no  practical  advantage  of  doing  such  a 
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combination  in  that  case,  but  there  are  other  examples  where  it  can  be  helpful.  For  the 
case  where  u>  contains  only  the  set  of  waiting  times,  one  faces  the  problem  of  expressing 
Ps(-).  In  fact,  the  waiting  times  are  dependent  random  variables  whose  distributions  are 
quite  complex  in  general.  Except  for  small  values  of  T,  this  is  not  practical.  Finally,  for 
the  extreme  case  where  u?  =  h(9,u),  one  has  to  compute  the  distribution  of  the  average 
cost  using  numerical  methods! 


4.2.  Discrete  law  with  0-dependent  value  of  the  probability  mass 

Let  0  <  a  <  6,  0  <  0  <  1,  and  suppose  that  the  service  time  is  b  with  probability  9,  and 
a  with  probability  1  —  6.  In  this  case,  IPA  doesn’t  apply  (see  also  [19]),  but  if  u;  contains 
the  set  of  actual  service  times,  then  LR  do  apply. 

Suppose  are  the  17(0,1)  variates  used  to  generate  the  service  times.  Let 

C;  =  1  and  Sy  =  bxiU,  <9,  C,  =  0  and  S,  =  a  otherwise  (C,  is  Bernouilli  (9)).  For  IPA, 
u>  contains  (Ulf. . . ,  Uj),  and  #(•,«)  =  h(-,u>)  is  discontinuous  since  S,  jumps  from  b  to  a 
at  9  =  U{.  This  is  why  IPA  doesn’t  work.  But  suppose  u  =  (Ai,  Si, . . . ,  At,  St).  In  this 
case,  the  likelihood  ratio  can  be  expressed  in  terms  of  the  variables  Ci,...,Cj-  Their  joint 
probability  mass  is 

pe(cu...,cT)  =  l[9c'(i-ey-c', 

1=1 

and  H(9,w)  —  K(u)pg(C\, . . . ,  Cr),  where  K(w)  =  h(9,u>)/pg0(Ci, . . .  ,Cj)  do  not  depends 
on  9  (because  u;  contains  ail  the  information  to  compute  h{9,  u>)  independently  of  6).  H{-,u) 
is  continuous  and  differentiable  on  (0, 1),  and  since  K(uj)  is  integrable  and  pg(-)  <  1,  Al  is 
satisfied  and  we  get  an  unbiased  estimate. 


4.3.  Discrete  law  with  0-dependent  support 

Suppose  that  the  service  time  is  9  with  probability  p,  and  29  with  probability  1  —  p,  where 
9  >  0  is  the  parameter  and  p  is  a  constant,  0  <  p  <  1.  Here,  the  unaive”  application 
of  LR,  where  u>  contains  the  set  of  actual  service  times  and  (?(■)  =  P$0{-),  doesn’t  apply 
because  there  is  no  neighbourhood  of  0O  in  which  the  Radon-Nikodym  derivative  exists 
(H(9,  u>)  =  0  everywhere  except  at  9  —  90,  and  so  it  is  discontinuous).  For  IPA,  use 
UU...,UT  to  generate  the  service  times:  Si  =  9  if  U{  <  p,  Si  =  29  otherwise.  ip(9,u >)  can 
je  computed  as  in  (8),  again  with  9Si/d9  =  Si/ 9.  The  arguments  to  verify  Al  are  the 
same  as  in  example  1,  and  so  IPA  applies. 
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4.4.  A  mixture  of  IPA  and  LR 


Let  q  be  a  constant,  0  <  q  <  1,  and  suppose  that  the  service  time  is  generated  as  in 
example  4.2  with  probability  q,  and  as  in  example  4.3  with  probability  1— q.  Let  A> . . . ,  Dt 
be  the  corresponding  Bernouilli  (q)  random  variables,  i.e.  A  =  1  if  the  service  time  of  the 
t-th  customer  is  generated  from  the  first  distribution  (example  4.2),  A  =  0  otherwise.  In 
that  case,  we  can  define  u>  as  the  set  of  values  of  A,  the  values  of  C,  for  the  customers  i  for 
which  A  =  I?  and  all  the  U( 0,1)  values  used  to  generate  the  rest  (the  interarrival  times 
and  the  other  5,’s).  When  A  =  0,  define  C,  =  0*  Since  the  likelihood  ratio  will  depend 
on  u>  only  throught  the  A  and  A,  let  us  define  u>  =  (A ,  C\ , . . . ,  Dt ,  Cj)-  The  probability 
mass  of  0)  is  given  by 

Pe(*)  =  II  9D,(l  ~  q)l'Dt  (0C‘(l  ~  ey~C')D' . 

«=i 


and  one  has 


A(A  -  0) 


This  can  be  computed  with  h(0,  u>)  during  the  simulation.  To  compute  V*h(0,  u>)  for  a  fixed 
w,  one  applies  a  mixture  of  the  more  traditional  IPA  and  LR  techniques:  the  service  times 
of  the  customers  for  which  A  =  0  are  “perturbed”  using  the  usual  IPA  technique,  while 
the  perturbations  for  the  other  service  times  are  considered  to  be  zero.  More  specifically, 
for  a  fixed  u>,  Vfl/i(0,u;)  is  computed  using  the  right  hand  side  of  (8),  but  with 


0  if  A  =  l; 

1  if  A  =  0  and  5,  =  0; 

2  if  Di  =  0  and  S,  =  20. 

Again,  assumption  A1  is  easily  verified,  by  combining  the  arguments  of  the  two  previous 
examples.  Therefore,  by  mixing  IPA  with  LR,  we  obtain  an  unbiased  gradient  estimate, 
despite  the  fact  that  neither  IPA  alone  nor  “naive”  LR  (putting  all  the  5,’s  in  u>)  works. 


Table  1  gives  the  results  of  a  numerical  experiment  for  this  example.  We  used  T  —  10, 
q  =  p  =  a  =  1/2,  b  =  3/2,  and  estimated  the  derivative  at  0  =  0.2, 0.5  and  0.9.  We  used  two 
gradient  estimation  techniques:  symmetric  finite  differences  with  common  random  numbers 
(FDC),  and  the  “hybrid”  method  described  above  (LR).  For  FDC,  simulations  were  made 
at  0  ±  0.01,  starting  from  the  same  (empty)  state,  and  the  same  t/(0, 1 )  values  were  used 
on  both  sides,  with  proper  synchronization.  In  each  case,  we  made  100000  replications 
and  computed  a  95%  confidence  interval.  Note  that  the  same  streams  of  random  numbers 
were  used  for  the  six  different  entries  of  table  1.  As  expected,  the  results  from  the  two 
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techniques  agree.  They  also  agree  with  the  exact  values,  which  were  computed  using 
dynamic  programming.  [To  compute  these  exact  values,  we  can  write  recursive  equations 
to  compute  V„(s)  and  V$V'n(s)  (in  terms  of  V’n+1(-)  and  V^Vn+1(‘)),  where  Vn(s)  represents 
the  expected  total  system  time  spent  from  now  on  by  the  next  T  —n  customers  to  depart, 
given  that  there  are  s  customers  in  the  system,  1  <  s  <  T  -  n,  one  of  which  is  beginning 
its  service.  Vj^x  is  the  expected  service  time,  and  V0{l)/T  is  the  expected  average  system 
time  for  the  first  T  customers.]  Note  that  for  FDC,  the  gradient  estimator  has  some  bias, 
due  to  the  finite  differences,  but  here,  that  bias  is  ulost  in  the  noise”,  since  the  confidence 
intervals  cover  the  exact  values. 


0 

True  grad. 

LR 

FDC 

0.2 

2.551 

2.53  ±  .04 

2.55  ±  .03 

0.5 

3.980 

3.96  ±  .06 

4.01  ±  .04 

0.9 

5.525 

5.53  ±  .16 

5.53  ±  .05 

Table  1:  Numerical  results  for  example  4: 

95%  confidence  intervals  for  the  gradient,  based  on  105  replications. 


4.5.  9  times  a  Bernoulli!  (0) 

Suppose  that  the  service  time  is  kO  with  probability  0,  and  0  with  probability  1  —  0,  for 
some  constant  k  >  0.  For  each  service  time,  one  can  generate  a  U(0, 1)  variate  U,,  and 
define  S,  =  kO  if  Ui  <  0,  Si  =  0  otherwise.  In  that  case,  it  is  easy  to  see  that  neither 
w  =  (A\,Ux,...,Aj,Ut),  nor  u  =  (Ai,Sj,. . . ,At,Sj)  will  work.  However,  if  one  takes 
=  (Ai  At,  Ct),  where  C,  =  Si/(kO),  then  LR  works.  In  fact,  C,  is  Bernouilli  ( 0 ). 

The  score  function  is 

and  Vgh(0,  u>)  can  be  computed  using  the  right  hand  side  of  (8),  with  dSi/90  =  kC{  =  Si/0 
(because  5,  =  kOC„  and  because  for  u>  fixed,  C,  is  fixed). 

For  this  example,  we  made  the  same  numerical  experiment  as  for  the  previous  example, 
with  k  =  2,  and  the  results  appear  in  table  2.  Again,  they  agree  very  well  with  the  exact 
values  (computed  by  dynamic  programming). 
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9 

True  grad. 

LR 

FDC 

i 

1.038 

1.03  ±  .01 

1.03  ±  .01 

m 

4.529 

4.49  ±  .05 

4.51  ±  .04 

\  0.9 

16.671 

16.61  ±  .30 

16.75  ±  .12 

Table  2:  Numerical  results  for  example  5: 

95%  confidence  intervals  for  the  gradient,  based  on  10s  replications. 


4.6.  Lifetime  of  a  fc-out-of-iV  system 


Consider  a  k-out-of-N  reliability  system  with  identical  components.  The  N  components 
have  independent  (random)  lifetimes  durations  JTi,... , Xs,  each  with  distribution  F$(-). 
For  simplicity,  assume  that  this  distribution  is  continuous,  with  density  /#(•),  and  that  for 
each  *  >  0,  both  Fg(x)  and  fe{x)  are  differentiable  w.r.t.  9.  The  system  is  down  (failed) 
when  less  than  k  components  are  still  alive.  For  a  given  realization  u>,  h(B,u>)  represents  the 
system’s  lifetime,  and  a(9)  its  expectation.  We  will  examine  four  of  the  (many)  possible 
choices  for  u>.  We  want  to  estimate  Va(0o),  using  (4).  Let  s  and  X,  denote  the  number  and 
lifetime  of  the  last  component  that  fails  (the  (N  -  k  +  l)-th  failure).  One  has  h(9,u>)  =  Xs. 

One  can  take  a;  =  (Xi , . . . ,  Xs),  in  which  case  Vgh(0,  u>)  =  0, 

*(».<■>) = x.  n 

i=l  j6 

and 

V„  In  P((du>)  =  f;^  Id /,(*,). 

This  is  the  most  straightforward  application  of  LR. 

A  second  choice  is  to  take  u;  as  a  and  X,,  plus  the  set  A  of  components  that  are  still 
alive  at  time  Xt.  During  the  simulation,  the  N  lifetimes  can  be  generated  as  above  (as 
usual),  and  V«h(0,  w)  =  0,  but  now, 


P,(<L,)/<fc,  =  U(X.)  no  -  F,(X.))  n  F.(X.), 

i£A  ifAu{t} 


and  the  score  function  is 


Vj  In  Pe(dw)  =  ^  L, 

i=i 
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where 


J^ln/nPr,)  if  >  =  j; 

^ln(l  ~F,(X,))  if  *  €  A; 

Q 

—  In  F$(  X, )  otherwise. 

do 

A  third  choice  is  to  use  the  IP  A  technique:  suppose  X<  =  Fg  1  ( Ut)  where  the  [7,’s 
are  independent  17(0,1)  variates,  for  i  =  1 ,  and  take  u>  =  (t/j, . . . ,  I7/v).  Here, 

Vgh(0,u)  =  8Fg'(U,)/ 86,  where  s  is  defined  as  above,  and  V$ln.Ps(du>)  =  0.  Note  that 
there  are  many  cases  where  we  can  obtain  equivalent  results  by  taking  u  as  something 
else  than  the  U,'s.  Consider  for  example  the  Weibull  distribution:  F(x)  =  1  —  exp(Ax°) 
and  F~*(l J)  =  (  — Aln(l  —  U ))^a.  In  this  case,  one  can  take  u  =  (Vi,...,VJv)  where 
Vi  =  —  ln(l  —  Ui)  and  this  is  equivalent  to  IPA  (for  9  =  A  or  8  =  a).  The  expression  for 
Vgh(6,uj)  can  be  obtained  with  slightly  less  manipulations  in  the  latter  case. 

For  exponential  lifetimes,  there  is  the  following  fourth  choice.  Let  I/O  be  the  failure 
rate  for  each  component,  and  let  the  state  of  the  system  be  defined  as  the  number  of 
components  that  are  still  alive.  That  system  evolves  as  a  continuous  time  Markov  chain. 
It  goes  from  N  to  N  -  1, . . . ,  to  k,  and  finally  to  fc  -  1  where  it  dies.  The  jump  rate  from 
N  —  i  to  N  —  i  —  1  is  (N  —  i)/6.  Hence,  w  can  be  defined  asw  =  (Yjv, . . . ,  Fit),  where  Y:  is 
exponential  with  mean  9/ j,  and  h(9, u>)  =  Yn  +  •  •  •  +  Y*.  One  has  in  this  case 

and  Vgh(0,v)  =  0.  Note  also,  that  for  the  exponential  case,  analytical  formulas  are 

obtained  readily:  a(0)  =  E[h(8,tj>)]  =  E[Ys  H - I-  Ft]  =  0(1/JVH - f  1/fc),  and  Va(0)  = 

(1  /N  H — -•  +  1/k).  For  many  other  distributions,  analytical  formulas  can  also  be  obtained 
by  exploiting  the  fact  that  X ,  is  an  order  statistic.  Often,  one  can  write  down  the  density 
(or  prob.  mass)  and  the  expectation  of  X ,  explicitly,  and  differentiate.  This  is  what  we 
did  to  compute  the  exact  gradient  for  the  Weibull  case  in  table  4  below. 

All  this  can  be  adapted  easily  to  more  general  reliability  networks,  with  more  complex 
structures,  components  that  have  different  lifetime  distributions  (with  possibly  different 
parameters),  repair  possibilities,  etc.  For  the  last  case  (exponential  lifetimes  and  fourth 
choice  of  w),  the  state  description  of  the  Markov  chain  would  get  more  complicated  in 
general.  A  “practical”  analytical  formula  is  not  always  available  for  the  exponential  case, 
but  replacing  transition  times  by  their  expectations  is  certainly  a  good  idea  (see  also 
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Exponential 

Weibull 

/*(*) 

(1 19)t-‘l\  x  >  0 

X9xe-1e~Xx\  x  >  0 

fi(*) 

1  -  e~x/e 

1 

Kb 

V 

H 

^X  for  X  =  F,~<(U) 

as 

-tfln(l  -  U) 

(— ln(l  —  U)/X)1'9 

X/9 

-X(\nX)/9 

(*  -  O)/#1 

1/9  +  (1  —  Axs)lnx 

Q 

—xt~*!6 

Axslnxe-Ar* 

89  lnFe{z) 

9*(  1  -  e~*/e) 

1  -  e~Xx> 

x/P 

— Axslnx 

Table  3:  Expressions  used  in  the  gradient  estimators, 
for  two  distributions. 

example  7).  In  practice,  importance  sampling  (when  it  works)  can  also  be  very  effective 
when  simulating  such  networks  (see  [17]). 

We  made  some  numerical  experiments,  first  with  the  exponential  distribution  with  mean 
9:  Fe(x)  =  1  —  exp(— x/0),  then  with  a  Weibull  distribution  with  9  as  the  form  parameter: 
Fe(x)  =  1  —  exp(— xs),  x  >  0.  We  tried  different  values  of  N  and  k.  The  results  appear 
in  table  4,  under  the  form  of  95%  confidence  intervals  on  Va(0).  The  “methods”  LRl, 
LR2,  IPA  and  MC  correspond  to  the  four  choices  of  u >  described  above,  in  the  same  order. 
Note  that  MC  does  not  apply  for  the  Weibull  distribution,  but  for  the  seven  other  cases, 
A1  can  be  verified  easily  for  all  9  >  0.  We  leave  that  as  an  (easy)  exercise  to  the  reader. 
Expressions  used  in  the  estimators  are  given  in  table  3  for  these  two  distributions. 

From  these  numerical  results,  LR2  appears  to  be  generally  better  than  LRl,  and  MC 
is  not  much  better  than  even  LRl.  The  most  efficient  by  far  is  certainly  IPA. 

4.7.  A  density  with  parameter-dependent  support 

In  the  previous  example,  suppose  that  the  component  lifetimes  are  uniformly  distributed, 
between  0  and  9.  The  density  of  Xi  is  /$(x)  =  1/0  for  0  <  x  <  9.  Since  the  support  of  /*(•) 
depends  on  9 ,  if  we  take  u>  =  (Xt,. . . ,  Xjv),  the  Radon-Nikodym  derivative  Ps(-)/Pe0(-) 
does  not  exists  at  9  >  90.  For  9  <  9 o ,  it  exists  in  a  neighbourhood  of  9 ,  and  (1)  is  valid. 
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N 

■ 

distrib. 

0 

True  grad. 

LRl 

LR2 

IPA 

MC 

8 

2 

Expon. 

1.0 

■B 

1.694±.042 

1.705±.036 

1.718±.004 

1.732±.040 

8 

6 

Expon. 

1.0 

-Egg 

0.430±.011 

0.436±.009 

0.436±.001 

0.440±.009 

8 

2 

Expon. 

5.0 

mum 

1.694±.042 

1.705±.036 

1.718±.004 

1.732±.040 

20 

5 

Expon. 

1.0 

1.485±.048 

1.483±.041 

1.513±.002 

1.523±.045 

20 

10 

Expon. 

1.0 

0.755±.025 

0.764±.020 

0.769±.001 

0.771±.020 

20 

15 

Expon. 

1.0 

imsm 

0.340±.012 

0.347±.008 

0.347±.001 

0.350±.008 

50 

10 

Expon. 

1.0 

1.669±.024 

1.674db.020 

1.671±.001 

1.650±.022 

50 

40 

Expon. 

1.0 

tag 

0.246±.004 

0.246±.002 

0.246±.001 

0.246±.002 

1 

1 

Weibull 

1.0 

1  BsSi 

-0.404±.033 

-0.404±.033 

-0.420±.010 

8 

2 

Weibull 

1.0 

-1.027±.057 

-1.052±.041 

-1.074±.008 

8 

6 

Weibull 

1.0 

Bill 

0.308±.012 

0.296±.004 

0.294±.001 

20 

5 

Weibull 

1.0 

1 

-0.618±.063 

-0.669±.031 

-0.682±.004 

20 

10 

Weibull 

1.0 

■n 

0.197±.030 

0.169i.007 

0.167±.001 

20 

15 

Weibull 

1.0 

mm 

0.350±.014 

0.342±.006 

0.340±.001 

Table  4:  95%  confidence  intervals  for  the  gradient, 
based  on  10s  replications  (example  5). 

One  may  think  of  using  that  to  estimate  the  gradient  at  0,  but  the  problem  is  that  since 
fe{x)  =  1/6  for  x  <  8  and  0  for  x  >  6 ,  the  continuity  assumption  on  27(-,u>)  does  not 
holds.  In  fact,  for  any  neighbourhood  of  8 ,  is  discontinuous  in  that  neighbourhood 

whenever  max,  Xi  is  in  it.  Therefore,  A1  is  not  satisfied.  This  illustrates  the  fact  that  even 
when  Vgh(0,u>)  =  0,  the  existence  of  the  Radon-Nikodym  derivative  is  not  a  sufficient 
condition  for  LR  to  apply. 

Note  that  IPA  applies  for  this  case:  the  gradient  estimator  is  X,/8 ,  the  same  as  for  the 
exponential  case. 

4.8.  A  continuous-time  Markov  chain 

Consider  a  continuous  time  Markov  chain  with  finite  state  space  5.  Let  A,  denote  the 
jump  rate  out  of  state  i,  and  pt]  be  the  transition  probability  from  *  to  j.  There  is  also  a 
cost  incurred  continuously  at  rate  c,  when  in  state  t.  Suppose  that  these  quantities  depend 
on  some  parameter  vector  0  €  Let  a(0)  be  the  total  expected  cost  for  the  first  T 
transitions,  where  T  is  fixed. 


17 


Simulation  is  often  the  most  convenient  tool  to  analyze  such  chains,  particularly  for 
very  large  state  spaces  (see  e.g.  [9]).  Note  that  here,  an  event  list  is  not  necessary  to  run 
the  simulation;  one  can  just  use  the  transition  probabilities  to  jump  from  state  to  state 
(see  [4]).  Typically,  the  transition  matrix  is  very  sparse,  and  from  any  given  state  t,  the 
number  of  reachable  states  is  small.  In  fact,  there  is  usually  no  need  to  write  down  that 
matrix,  neither  to  enumerate  S.  Take  for  instance  a  closed  Jackson  network  with  say  20 
nodes  and  100  customers  (one  server  per  node,  one  class  of  customers):  the  state  space 
is  huge,  but  from  any  given  state  with  say  B  busy  nodes,  there  are  only  B  possibilities 
for  the  node  where  the  departure  occurs  and  at  most  20  possibilities  for  the  destination 
node  of  the  departing  customer.  It  is  quite  easy  to  generate  the  two  corresponding  discrete 
variables  and  there  is  no  need  to  generate  explicitly  even  a  row  of  the  transition  matrix. 
(See  also  [4]  for  a  slightly  different  approach.) 

Let  {Xn,n  >  0}  be  the  embedded  Markov  chain  (the  sequence  of  visited  states),  0  = 
r0  <  Ti  <  T2  <  •  •  •  the  transition  times  (the  system  jumps  into  Xn  at  time  rn),  and  for 
n  >  0,  (n  =  Tn+i  —  Tn.  Note  that  £„  is  exponential  with  mean 

A  simple  choice  for  u?  is  u>  =  (-Xa,£o, -X’jjCi, . . .  ,JTt-i,Ct-i)*  Except  for  very  small 
T,  the  variance  of  the  gradient  estimator  is  then  usually  quite  high,  and  as  the  previous 
examples  suggest,  one  would  usually  prefer  to  use  IPA  if  it  is  applicable.  Unfortunately, 
IPA  rarely  works  for  the  transition  probabilities,  but  it  can  be  used  here  for  the  times 
between  jumps:  take  u;  =  (Jf0,  Uq , . . .  ,Xt~uUt~i),  where  U<  is  the  17(0, 1)  variate  used  to 
generate  (by  inversion).  This  could  certainly  help,  but  there  is  a  better  choice:  just  take 
u )  =  (Xo, . . . ,  Xr_x).  The  £<*$  can  simply  be  replaced  by  their  expectations.  This  reduces 
simultaneously  the  variance  of  the  cost  estimate  and  the  variance  of  the  likelihood  ratio. 
In  fact,  there  is  no  need  to  generate  any  £,.  The  cost  estimate  is  simply 

X)cx„/Ax„.  (9) 

n=0 

If  all  the  pij  depend  on  9,  there  are  still  T  —  1  terms  in  the  score  function  (assuming  Xo 
fixed),  and  the  variance  could  kill  us  for  large  T.  There  might  be  cases,  however,  where 
only  some  of  the  p./s  depend  on  9,  and  this  can  makes  a  big  difference  in  the  variance  of 
the  gradient  estimator.  If  only  the  transition  times  (and  not  the  transition  probabilities) 
are  influenced  by  0,  then  IPA  applies  and  the  gradient  estimator  is  readily  obtained  by 
differentiating  (9). 

Note  that  on  the  other  hand,  terms  in  the  likelihood  ratios  cannot  be  replaced  (in 
general)  by  their  expectations.  For  example,  the  score  function  in  (4)  has  zero  expectation, 
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but  is  correlated  with  h(9,u>)  and  the  expectation  of  the  product  is  not  zero.  There  are 
special  cases,  however,  where  it  can  be  done:  see  e.g.  Algorithm  B  in  [8]. 

With  minor  adaptations,  the  above  approach  also  applies  to  semi-Markov  processes, 
where  the  inter-jump  times  are  no  more  exponential,  and  to  the  case  where  T  is  a  random 
stopping  time. 

4.9.  Components  replacement 

(Taken  from  [11]).  Consider  a  system  comprised  of  N  identical  components,  that  evolve 
independently.  Each  component  has  a  random  lifetime  distribution  with  increasing  failure 
rate.  Whenever  a  component  fails,  it  must  be  replaced  instantly  by  a  new  one.  Other 
components  may  be  replaced  (preventively)  at  the  same  occasion.  The  repaiman  can 
also  halt  the  system  at  any  moment  and  replace  preventively  any  number  of  working 
components.  All  replacements  are  assumed  instantaneous.  A  failure  cost  c/  is  incurred 
every  time  a  component  fails.  At  each  intervention,  there  is  also  a  fixed  cost  c,,  and 
a  replacement  cost  which  is  cr  times  the  number  of  components  replaced.  Preventive 
replacements  are  made  to  avoid  some  of  the  failures,  and  replacements  are  sometimes 
lumped  together  to  pay  the  fixed  cost  less  often. 

Here,  we  restrict  our  attention  to  the  (generally  suboptimal)  class  of  policies  defined 
by  two  thresholds:  9\  >  02  >  0.  Whenever  a  component  fails  or  reaches  age  0\,  the 
repaiman  intervenes  and  replaces  all  components  older  than  02-  We  are  interested  in  the 
total  cost  for  a  fixed  duration  T,  assuming  that  all  components  are  new  at  the  beginning. 
The  parameter  here  is  6  =  (0j, #2). 

Unfortunately,  applying  LR  to  estimate  the  gradient  in  this  case  is  still  an  open  problem. 
Suppose  (j)  is  the  set  of  generated  component  lifetimes  (for  fixed  9 ,  this  is  enough  to  compute 
the  cost).  Then,  the  likelihood  ratio  is  always  one,  since  the  component  lifetimes  do  not 
depend  on  9 ,  but  k(9,u)  is  discontinuous  in  9.  More  specifically,  for  any  neighbourhood 
T  of  9  =  (01,02),  h(',u)  will  be  discontinuous  in  T  if  some  component  lifetimes  are  near 
enough  9\  or  02  to  change  the  sequence  of  failures  when  9  changes  inside  T  (for  fixed  ui), 
and  the  set  of  values  of  u>  for  which  this  happens  has  positive  probability.  Exactly  the  same 
problem  occurs  with  IPA  (u>  is  the  set  of  17(0,1)  values  used  to  generate  the  component 
lifetimes).  Suppose  now  that  u>  includes  the  sequence  of  all  failures  and  replacements,  with 
their  times.  Now,  h(0,u)  becomes  independent  of  9,  but  in  general,  the  Radon-Nikodym 
derivative  doesn’t  exists  in  a  neighbourhood  of  9,  since  a  typical  u>  will  have  non-zero 
probability  for  only  one  value  of  9. 
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For  the  moment,  we  don’t  know  how  to  apply  LR  or  IPA  to  estimate  the  gradient  for 
this  example.  Many  other  examples,  most  of  them  involving  “threshold”  parameters,  fall 
into  this  category.  For  instance,  think  of  a  ( s,  S)  inventory  systems,  where  6  =  (a,  $),  or  a 
time-sharing  computer  system  where  the  parameter  is  the  quantum  size,  or  a  checkpoint- 
rollback-recovery  system  (for  databases)  where  6  is  the  time  between  checkpoints  (or  is 
used  in  a  rule  to  decide  the  next  checkpoint  time,  based  on  the  state  of  the  system),  etc. 
At  present,  for  all  these  examples,  to  the  best  of  our  knowledge,  a  “finite  differences” 
approach  (preferably  with  common  random  numbers)  must  be  used. 
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5.  Conclusion 


We  pointed  out  the  strong  relationship  that  exists  between  IPA  and  LR.  Section  3  also 
provides  easily  verifiable  conditions  under  which  IPA  and/or  LR  apply.  When  they  do 
not  apply,  these  conditions  sometimes  permit  us  to  understand  why.  We  have  illustrated 
with  examples  some  ideas  related  to  this  approach.  In  particular,  there  are  often  many 
different  ways  to  implement  LR,  some  being  much  more  efficient  than  others.  IPA  and 
more  “traditional”  LR  (i.e.  as  used  for  instance  in  [8,  14,  15,  16])  can  sometimes  be 
combined  on  the  same  problem,  and  for  the  same  parameter.  However,  when  IPA  applies, 
it  is  typically  the  most  efficient  method. 

In  practice,  the  change  of  measure  used  to  define  (1-3)  can  sometimes  be  used  to 
reduce  the  variance  (importance  sampling  [17]).  We  have  not  explored  that  issue  in  this 
paper.  For  all  the  examples  in  section  4,  we  have  used  G(  )  =  P$0  to  estimate  Va(0o), 
but  substantial  variance  reductions  can  sometimes  be  obtained  by  using  a  different  (and 
appropriate)  (?(•). 

For  some  examples,  it  appears  that  finite  differences  (FD)  remains  the  only  applicable 
approach  at  this  time.  Some  experimental  evidence  ([13]  and  example  4  in  this  paper) 
suggests  that  FD  with  common  random  numbers  might  be  practically  as  good  as  IPA 
when  9  has  only  one  component  (d  =  1).  But  for  large  numbers  of  parameters  (large  value 
of  d ),  performing  all  the  simulations  required  for  FD  becomes  rather  time  consuming,  and 
a  good  LR  implementation  might  beat  FD  significantly. 
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